Using likelihood L-statistics to measure confidence in audio-visual speech recognition
نویسندگان
چکیده
This paper describes recent work on decision fusion in audiovisual speech recognition. In this work, a novel approach is proposed to combine audio and video channel information in audiovisual speech recognition scenario. We have considered framelevel phonetic classification problem using two single-stream Gaussian Mixture Models. Audio and video streams are adaptively weighted using a cumulative mean of the sample confidence values over past frames in addition to the present sample confidence value. The confidence values for audio and video decisions are computed using an L-statistics (linear combination of order-statistics) of log-likelihoods against phone models. It is shown through various experiments, on a database of about 15000 sentences from large vocabulary continuous speech, that the proposed approach results in better classification accuracy as compared to other approaches.
منابع مشابه
Using Likelihood L-statistic as Confidence Measure in Audio-visual Speech Recognition
This paper describes recent work on decision fusion in audio-visual speech recognition. In this work, a novel approach is proposed to combine audio and video channels information in audio-visual speech recognition scenario. For simplicity, we have only considered frame-level phonetic classification problem using two singlestream Gaussian Mixture Model (GMM). Audio and video streams are adaptive...
متن کاملStream confidence estimation for audio-visual speech recognition
We investigate the use of single modality confidence measures as a means of estimating adaptive, local weights for improved audio-visual automatic speech recognition. We limit our work to the toy problem of audio-visual phonetic classification by means of a two-stream Gaussian mixture model (GMM), where each stream models the class conditional audioor visual-only observation probability, raised...
متن کاملImproving visual noise insensitivity in small vocabulary audio visual speech recognition applications
Visual noise insensitivity is important to audio visual speech recognition (AVSR). Visual noise can take on a number of forms such as varying frame rate, occlusion, lighting or speaker variabilities. In this paper the use of a high dimensional secondary classifier on the word likelihood scores from both the audio and video modalities is investigated for the purposes of adaptive fusion. Prelimin...
متن کاملUsing twin-HMM-based audio-visual speech enhancement as a front-end for robust audio-visual speech recognition
In this paper we propose the use of the recently introduced twinHMM-based audio-visual speech enhancement algorithm as a front-end for audio-visual speech recognition systems. This algorithm determines the clean speech statistics in the recognition domain based on the audio-visual observations and transforms these statistics to the synthesis domain through the socalled twin HMMs. The adopted fr...
متن کاملCorrecting Korean vowel speech recognition errors with limited lip features
In the experiment, we evaluate the audio-only and the selected lip feature based visual-only speech recognition performances separately. For audio-only speech recognition, we use HTK3.2 [4], and adopt normalized log likelihood ratio (LLR) scores as the N-best confidence scores. For the visual part, we build a back propagation neural network by using SNNS 4.2 [5] based on the selected lip featur...
متن کامل